refactor: split `BRIGHT` benchmark into individual subset tasks #3285

whybe-choi · 2025-10-07T13:20:42Z

This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.

Benchmark additions

Introduced two new benchmarks, BRIGHT_SUBSETS and BRIGHT_SUBSETS_LONG, to the mteb/benchmarks/benchmarks/benchmarks.py file, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]
Registered the new benchmarks in the mteb/benchmarks/benchmarks/__init__.py file for import and usage. [1] [2]

Descriptive statistics

Added descriptive statistics JSON files for each new BRIGHT subset retrieval task, including both standard and long formats (e.g., BrightBiologyRetrieval.json, BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Minor improvement

Minor formatting fix in the BEIR_NL benchmark description for improved readability.

mteb/benchmarks/benchmarks/benchmarks.py

mteb/tasks/Retrieval/eng/BrightRetrieval.py

KennethEnevoldsen

Hmm, this change will invalidate all previous results on BRIGHT.

You know that you can also simply subselect from a task using:

task = mteb.get_task("BrightRetrieval", eval_splits=..., hf_subet=...)

For the leaderboard display it is even possible to create custom summary tables (see e.g. #3272)

Samoed · 2025-10-07T14:48:10Z

You know that you can also simply subselect from a task using:

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

KennethEnevoldsen · 2025-10-07T15:07:28Z

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

Ohh... Yeah that is hard to fix.

I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them

Muennighoff · 2025-10-07T15:45:39Z

If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one.

whybe-choi · 2025-10-08T00:41:16Z

Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test?

Samoed · 2025-10-08T07:18:58Z

To check implementation, this will be enough, just don't update old leaderboard

whybe-choi · 2025-10-09T07:46:48Z

After split BrightRetrieval into multiple tasks, I ran ReasonIR on them with task-specific prompts using the following code:

import torch
import mteb

# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
    "BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}

tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="evaluation/results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.	Avg.
before split	24.31	30.83	24.27	28.95	18.40	21.68	20.57	18.14	9.49	4.84	18.21	26.42	20.51
after split	26.18	30.71	23.96	29.76	18.62	21.15	19.89	19.65	9.22	5.12	18.34	27.12	20.81

In the paper:

Samoed · 2025-10-09T07:54:26Z

Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought get_model?

whybe-choi · 2025-10-09T08:08:43Z

mteb/mteb/models/instruct_wrapper.py

Lines 158 to 171 in d2c704c

    
           if instruction: 
        
               logger.info(f"Using instruction: '{instruction}' for task: '{task_name}'") 
        
           embeddings = self.model.encode( 
        
               sentences, 
        
               prompt=instruction, 
        
               **kwargs, 
        
           ) 
        
           if isinstance(embeddings, torch.Tensor): 
        
               # sometimes in kwargs can be return_tensors=True 
        
               embeddings = embeddings.cpu().detach().float().numpy() 
        
           return embeddings

After adding code to print the instruction inside the code, the following output was produced:

# Biology
Retrieval
    - BrightBiologyRetrieval, s2p


instruction: <|user|>
Given a Biology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:06<00:00, 15.80it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 2/50000 [00:02<18:01:38,  1.30s/it

# Psychology
Retrieval
    - BrightPsychologyRetrieval, s2p


instruction: <|user|>
Given a Psychology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:07<00:00, 14.12it/s]
instruction: <|embed|>

Batches:   0%|                                                                                                       | 0/50000 [00:01<?, ?it/s]

# Aops
Retrieval
    - BrightAopsRetrieval, s2p


instruction: <|user|>
Given a Math problem, retrieve relevant examples that help answer the problem<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [00:06<00:00, 16.13it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 17/50000 [00:09<7:16:33,  1.91it/s]

Samoed · 2025-10-09T08:38:26Z

Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case.

I've checked code for ReasonIR and found some other places that can help to reproduce:

For some cases, rewritten query is concatenated with query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L82-L87
Sometimes reason trases added to the query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L124
Maybe ids can be filtered (ref Excluded IDs missing from BRIGHT dataset #2696) but in ReasonIR code they're just check that no ids are intersect https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L130-L131

@Muennighoff Can you help what we can do to reproduce results?

Muennighoff · 2025-10-09T18:27:28Z

I think the IDs filtering is probably the main missing piece to fully reproduce results?

whybe-choi · 2025-10-10T07:48:08Z

I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single ReasonIR model seems to be related to the issue mentioned in point 3.

# Conflicts: # mteb/benchmarks/benchmarks/__init__.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py # mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py

whybe-choi · 2025-10-22T14:30:13Z

@Samoed

I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Samoed · 2025-10-22T14:44:04Z

I think it would be better to close this PR and work on it later together

Do you mean that you don't want tasks in this pr and will add another PR for #2696?

Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Yes, you need to add statistic to merge. To apply v2 format, you can select subsets from https://huggingface.co/datasets/mteb/BrightRetrieval, but retrieval dataset loader reqired dataset to have strictly corpus, qrels and quries, maybe we need to reupload them instead

whybe-choi · 2025-10-22T15:53:38Z

What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help.

Samoed · 2025-10-22T16:18:23Z

I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution

Samoed · 2025-10-22T16:20:34Z

mteb/tasks/retrieval/eng/bright_subsets_long_retrieval.py

+    domain_corpus_long = datasets.load_dataset(
+        path,
+        "long_documents",
+        split=domain,
+        cache_dir=cache_dir,
+        revision=revision,
+    )
+    examples = datasets.load_dataset(
+        path,
+        "examples",
+        split=domain,
+        cache_dir=cache_dir,
+        revision=revision,
+    )
+    corpus["long"] = {e["id"]: {"text": e["content"]} for e in domain_corpus_long}
+    queries["long"] = {e["id"]: e["query"] for e in examples}
+    relevant_docs["long"] = defaultdict(dict)


To follow v2 format, you can remove conversion dataset to dict and pass dataset directly.

domain_corpus_long = domain_corpus_long.rename_column("content", "text") queries = queries.rename_column("query", "text") ... return domain_corpus_long, queires, relevant_docs

Samoed · 2025-10-22T16:22:03Z

mteb/tasks/retrieval/eng/bright_subset_long_retrieval.py

+        if self.data_loaded:
+            return
+
+        self.corpus, self.queries, self.relevant_docs = load_bright_long_data(


And then here it should look like

self.dataset["default"]["long"]["corpus"], self.dataset["default"]["long"]["queries"], self.dataset["default"]["long"]["relevant_documents"]

You can refer to

mteb/mteb/abstasks/retrieval_dataset_loaders.py

Lines 25 to 38 in 0ead029

class RetrievalSplitData(TypedDict):

"""A dictionary containing the corpus, queries, relevant documents, instructions, and top-ranked documents for a retrieval task.

Attributes:

corpus: The corpus dataset containing documents. Should have columns `id`, `title`, `text` or `image`.

queries: The queries dataset containing queries. Should have columns `id`, `text`, `instruction` (for instruction retrieval/reranking) or `image`.

relevant_docs: A mapping of query IDs to relevant document IDs and their relevance scores. Should have columns `query-id`, `corpus-id`, `score`.

top_ranked: A mapping of query IDs to a list of top-ranked document IDs. Should have columns `query-id`, `corpus-ids` (list[str]). This is optional and used for reranking tasks.

"""

corpus: CorpusDatasetType

queries: QueryDatasetType

relevant_docs: RelevantDocumentsType

top_ranked: TopRankedDocumentsType | None

Samoed · 2025-10-24T16:48:09Z

Great! So for now most different task is pony?

whybe-choi · 2025-10-24T16:53:33Z

Among the tasks with excluded_ids, pony seems to be the most different. The other tasks seem to have reproduced the performance reported in the paper to some extent.

Samoed · 2025-10-24T17:39:06Z

task	Paper	PR	Diff
Aops	14.7	15.6	+0.9
Biology	26.2	26.1	-0.1
Economics	23.3	24.0	+0.7
Pony	10.5	9.3	-1.2
Robotics	18.0	18.6	+0.6
StackOverflow	23.9	21.1	-2.8
TheoremQAQuestion	31.9	30.1	-1.8
TheoremQATheorem	27.2	26.5	-0.7

I think the main difference because of that you've evaluated shot version of datasets, but this is hard to get how tasks were produced in paper table. @Muennighoff Can you help with scores reproduction?

Muennighoff · 2025-10-24T18:12:46Z

Scores looking really close, great work. Are you asking me whether in the paper they were evaluated with shots or without?

Samoed · 2025-10-24T18:21:18Z

were evaluated with shots or without?

Yes

Muennighoff · 2025-10-24T18:39:14Z

Yeah I think those specific paper results are zero-shot

whybe-choi · 2025-10-25T08:20:50Z

I set max_seq_length to 32768 based on the following reference. Is this correct? :
https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/retrievers.py#L725-L726

Muennighoff · 2025-10-25T13:35:58Z

Yeah that seems right to me (cc'ing @RulinShao in case she has thoughts on if we're missing sth for full reproduction or scores seem close enough)

whybe-choi · 2025-10-28T13:34:56Z

I made a mistake by omitting a newline (\n) before <|embed|> in the query instruction. After correcting this and re-evaluating the performance, the score for the biology task was 26.87 (previously it was 26.1). Since this change is likely to affect the performance of other tasks as well, I will rerun the experiments and attach the updated results accordingly.

Also, I would like to ask if it would be better to handle this modification as a new PR.

mteb/mteb/models/model_implementations/reasonir_model.py

Lines 16 to 25 in 976fadf

    
           def instruction_template( 
        
               instruction: str, prompt_type: PromptType | None = None 
        
           ) -> str: 
        
               return ( 
        
                   # https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/configs/reasonir/economics.json#L3 
        
                   f"<|user|>\n{instruction}<|embed|>\n" 
        
                   if (prompt_type is None or prompt_type == PromptType.query) and instruction 
        
                   else "<|embed|>\n" 
        
               )

# https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/configs/reasonir/biology.json
{
  "instructions": {
    "query": "<|user|>\nGiven a {task} post, retrieve relevant passages that help answer the post\n<|embed|>\n",
    "document": "<|embed|>\n"
  },
  "instructions_long": {
    "query": "<|user|>\nGiven a {task} post, retrieve relevant documents that help answer the post\n<|embed|>\n",
    "document": "<|embed|>\n"
  }
}

Samoed · 2025-10-28T14:02:32Z

I think it would be better to make fix in separate PR

whybe-choi · 2025-10-30T03:19:01Z

When I look at the original repository, it seems like the TASK_MAP variable exists but is not being used. Because of this, the task names included in the instruction are slightly different (e.g., Biology -> biology, Sustainable Living ->sustainable_living). I re-evaluated the performance to reflect this change.

The performance was measured based on the following code:

import torch
import mteb
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

prompts_dict = {
    "BrightBiologyRetrieval": "Given a biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a earth_science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a sustainable_living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}


model_path = "ReasonIR/ReasonIR-8B"
model_name = model_path.split("/")[-1]

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    max_seq_length=32768,
    prompts_dict=prompts_dict,
)
cache_dir = "evaluation/cache/bright"
for task_name in prompts_dict.keys():
    print(f"task: {task_name}")
    tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
    cache = mteb.cache.ResultCache(cache_dir)

    try:
        mteb.evaluate(
            model,
            tasks,
            cache=cache,
            overwrite_strategy="only-missing",
            prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
            encode_kwargs={"batch_size": 1},
        )
        print(f"{task_name} completed successfully")
        torch.cuda.empty_cache()

    except torch.cuda.OutOfMemoryError:
        print(f"{task_name} skipped due to OOM error")
        torch.cuda.empty_cache()
        continue

The performance differences are as follows:

task	Paper	PR	Diff
Aops	14.7	14.7	0
Biology	26.2	26.3	+0.1
Economics	23.3	23.8	+0.5
Pony	10.5	10.0	-0.5
Robotics	18.0	18.1	+0.1
StackOverflow	23.9	20.6	-3.3
TheoremQAQuestion	31.9	29.8	-2.1
TheoremQATheorem	27.2	26.7	-0.5

I'm not sure if the problem is with torch_dtype.

Samoed · 2025-10-30T08:35:35Z

Interestingly, that score on most task dropped. On aops on 0.9, stackoverflow 0.5.

@whybe-choi You can try to reproduce scores for bge-large-en-v1.5, all-mpnet-base-v2, or bm25 models from bright paper https://arxiv.org/pdf/2407.12883, but I don't understand if they reported short or long versions too.

UPD. In BRIGHT paper I think in main table they reported short version, because they have long version in Table 39

mteb/tasks/retrieval/eng/bright_subsets_long_retrieval.py

whybe-choi · 2025-10-30T16:18:08Z

To reproduce results for bge-large-en-v1.5, I conducted the evaluation based on the following code:

import torch
import mteb
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

prompts_dict = {
    "BrightBiologyRetrieval-query": "Represent this biology post for searching relevant passages: ",
    "BrightEarthScienceRetrieval-query": "Represent this earth_science post for searching relevant passages: ",
    "BrightEconomicsRetrieval-query": "Represent this economics post for searching relevant passages: ",
    "BrightPsychologyRetrieval-query": "Represent this psychology post for searching relevant passages: ",
    "BrightRoboticsRetrieval-query": "Represent this robotics post for searching relevant passages: ",
    "BrightStackoverflowRetrieval-query": "Represent this stackoverflow post for searching relevant passages: ",
    "BrightSustainableLivingRetrieval-query": "Represent this sustainable_living post for searching relevant passages: ",
    "BrightPonyRetrieval-query": "Represent this Pony question for searching relevant passages: ",
    "BrightLeetcodeRetrieval-query": "Represent this Coding problem for searching relevant examples: ",
    "BrightAopsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
    "BrightTheoremQATheoremsRetrieval-query": "Represent this Math problem for searching relevant theorems: ",
    "BrightTheoremQAQuestionsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
}


model_path = 'BAAI/bge-large-en-v1.5'
model_name = model_path.split("/")[-1]

model = mteb.get_model(
    model_path,
    model_kwargs={"torch_dtype": torch.float32},
    tokenizer_kwargs={"max_seq_length": 512},
    model_prompts=prompts_dict,
)
cache_dir = "evaluation/cache/bright_v2"
for task_name in prompts_dict.keys():
    task_name = task_name.split("-")[0]
    print(f"task: {task_name}")
    tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
    cache = mteb.cache.ResultCache(cache_dir)

    try:
        mteb.evaluate(
            model,
            tasks,
            cache=cache,
            overwrite_strategy="only-missing",
            prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
            encode_kwargs={"batch_size": 1},
        )
        print(f"✅ {task_name} completed successfully")
        torch.cuda.empty_cache()

    except torch.cuda.OutOfMemoryError:
        print(f"⚠️ {task_name} skipped due to OOM error")
        torch.cuda.empty_cache()
        continue

The results are as follows:

task	Paper	PR	Diff
Biology	11.7	12.0	+0.3
Earth Science	24.6	24.2	-0.4
Economis	16.6	16.6	0
Psychology	17.5	17.5	0
Robotics	11.7	12.2	+0.5
Stackoverflow	10.8	9.5	-1.3
Sustainable Living	13.3	13.3	0
Leetcode	26.7	26.7	0
Pony	5.7	5.6	-0.1
AoPS	6.0	6.1	+0.1
TheoremQAQuestion	13.0	12.6	-0.4
TheoremQATheorem	6.9	5.5	-1.4

Samoed · 2025-10-30T16:27:41Z

@whybe-choi You didn't add a code

Samoed · 2025-10-30T16:48:16Z

I will try to rerun bge from bright repo

Samoed · 2025-10-31T08:29:59Z

I run bge on earth, biology and pony and got same results as in paper

whybe-choi · 2025-10-31T09:28:57Z

did I miss anything when evaluating bge model using mteb?

Samoed · 2025-10-31T10:11:43Z

I don't know for now, I will try to dig dipper on weekends

Samoed · 2025-11-05T11:00:22Z

I tried to debug it, but I'm getting slightly different embeddings for texts. For example with mteb I'm getting

Represent this Pony question for searching relevant passages: I will use the programming language pony.
Problem:
You are given two strings word1 and word2. Merge the strings by adding letters in alternating order, starting with word1. If a string is longer than the other, append the additional letters onto the end of the merged string. Write a funtion that returns the merged string.

Here is the code template:
fun mergeAlternately(word1: String, word2: String): String =>
...

[ 0.03160078  0.00635942  0.01886853 ... -0.01480357 -0.00048236
 -0.02248559]

But in BRIGHT repo I get

Represent this Pony question for searching relevant passages: I will use the programming language pony.
Problem:
You are given two strings word1 and word2. Merge the strings by adding letters in alternating order, starting with word1. If a string is longer than the other, append the additional letters onto the end of the merged string. Write a funtion that returns the merged string.

Here is the code template:
fun mergeAlternately(word1: String, word2: String): String =>
...

[ 0.03159456  0.00634383  0.01886596 ... -0.01478089 -0.00048517
 -0.02248857]

I even added to encode normalize_embeddings=True and used same dependencies as in BRIGHT repo, but still getting this difference. https://github.com/xlang-ai/BRIGHT/blob/d99e8391d967d4c2b3a74732530d2309e2fc92b6/retrievers.py#L243

Which leads to difference in result similarities, e.g. for ["0"]["Pony/src-math-is_prime-_0.txt"]

0.7551534175872803 # BRIGHT
0.7552036643028259 # mteb

So, I think scores are close as possible to integrate. I don't know how to reproduce this fully. NDCG@1 for pony is the same

whybe-choi · 2025-11-05T11:13:42Z

I'm sorry to hear that. Is there any additional work I need to do on this PR?

Samoed · 2025-11-05T11:17:28Z

I think you can add prompts to the tasks and make better description for them

whybe-choi · 2025-11-05T14:13:01Z

I think it’s tricky to add a prompt because the format of the prompt varies for each model. For example, each model uses the following prompt for BrightBiologyRetrieval.

bge-large-en-v1.5: Represent this biology post for searching relevant passages:
ReasonIR: Given a biology post, retrieve relevant passages that help answer the post

Samoed · 2025-11-05T14:59:08Z

I think you can add prompt like for bge

whybe-choi changed the title ~~refactor: split BRIGHT benchmark into individual subset tasks~~ refactor: split BRIGHT benchmark into individual subset tasks Oct 7, 2025

Samoed reviewed Oct 7, 2025

View reviewed changes

mteb/benchmarks/benchmarks/benchmarks.py Show resolved Hide resolved

mteb/tasks/Retrieval/eng/BrightRetrieval.py Outdated Show resolved Hide resolved

Samoed requested a review from Muennighoff October 7, 2025 13:40

whybe-choi force-pushed the bright-subset-tasks branch from 4240bdb to 826990a Compare October 7, 2025 14:36

This comment was marked as resolved.

Sign in to view

KennethEnevoldsen requested changes Oct 7, 2025

View reviewed changes

Samoed added the new dataset Issues related to adding a new task or dataset label Oct 7, 2025

whybe-choi force-pushed the bright-subset-tasks branch from 826990a to 3ed620f Compare October 8, 2025 11:33

refactor: split BRIGHT benchmark into individual subset tasks

57c757f

whybe-choi force-pushed the bright-subset-tasks branch from 3ed620f to 57c757f Compare October 8, 2025 11:53

Samoed and others added 4 commits October 20, 2025 21:56

Merge branch 'main' into bright-subset-tasks

b04b46e

# Conflicts: # mteb/benchmarks/benchmarks/__init__.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py # mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py

readd bright

7299e59

Merge branch 'embeddings-benchmark:main' into bright-subset-tasks

bf31a79

readd bright subset tasks

3f875a2

Samoed reviewed Oct 22, 2025

View reviewed changes

feat: add top_ranked for excluded_ids handling

f95a246

whybe-choi force-pushed the bright-subset-tasks branch from ef52c84 to f95a246 Compare October 24, 2025 15:48

Samoed reviewed Oct 30, 2025

View reviewed changes

mteb/tasks/retrieval/eng/bright_subsets_long_retrieval.py Outdated Show resolved Hide resolved

change main score to recall@1 for long version

9df0bba

Samoed requested a review from KennethEnevoldsen November 5, 2025 11:16

improve BRIGHT task descriptions

c9a30bd

add prompts to BRIGHT retrieval tasks

3b1e90b

	class RetrievalSplitData(TypedDict):
	"""A dictionary containing the corpus, queries, relevant documents, instructions, and top-ranked documents for a retrieval task.

	Attributes:
	corpus: The corpus dataset containing documents. Should have columns `id`, `title`, `text` or `image`.
	queries: The queries dataset containing queries. Should have columns `id`, `text`, `instruction` (for instruction retrieval/reranking) or `image`.
	relevant_docs: A mapping of query IDs to relevant document IDs and their relevance scores. Should have columns `query-id`, `corpus-id`, `score`.
	top_ranked: A mapping of query IDs to a list of top-ranked document IDs. Should have columns `query-id`, `corpus-ids` (list[str]). This is optional and used for reranking tasks.
	"""

	corpus: CorpusDatasetType
	queries: QueryDatasetType
	relevant_docs: RelevantDocumentsType
	top_ranked: TopRankedDocumentsType \| None

refactor: split BRIGHT benchmark into individual subset tasks #3285

Are you sure you want to change the base?

refactor: split BRIGHT benchmark into individual subset tasks #3285

Uh oh!

Conversation

whybe-choi commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark additions

Descriptive statistics

Minor improvement

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Oct 7, 2025

Uh oh!

Muennighoff commented Oct 7, 2025

Uh oh!

whybe-choi commented Oct 8, 2025

Uh oh!

Samoed commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybe-choi commented Oct 9, 2025

Uh oh!

Samoed commented Oct 9, 2025

Uh oh!

whybe-choi commented Oct 9, 2025

Uh oh!

Samoed commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff commented Oct 9, 2025

Uh oh!

whybe-choi commented Oct 10, 2025

Uh oh!

whybe-choi commented Oct 22, 2025

Uh oh!

Samoed commented Oct 22, 2025

Uh oh!

whybe-choi commented Oct 22, 2025

Uh oh!

Samoed commented Oct 22, 2025

Uh oh!

Samoed Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed commented Oct 24, 2025

Uh oh!

whybe-choi commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff commented Oct 24, 2025

Uh oh!

Samoed commented Oct 24, 2025

Uh oh!

Muennighoff commented Oct 24, 2025

Uh oh!

whybe-choi commented Oct 25, 2025

Uh oh!

Muennighoff commented Oct 25, 2025

Uh oh!

whybe-choi commented Oct 28, 2025

Uh oh!

Samoed commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

refactor: split `BRIGHT` benchmark into individual subset tasks #3285

refactor: split `BRIGHT` benchmark into individual subset tasks #3285

whybe-choi commented Oct 7, 2025 •

edited

Loading

Samoed commented Oct 7, 2025 •

edited

Loading

Samoed commented Oct 8, 2025 •

edited

Loading

Samoed commented Oct 9, 2025 •

edited

Loading

whybe-choi commented Oct 24, 2025 •

edited

Loading

Samoed commented Oct 24, 2025 •

edited

Loading

Samoed commented Oct 28, 2025 •

edited

Loading

Samoed commented Oct 30, 2025 •

edited

Loading

whybe-choi commented Oct 30, 2025 •

edited

Loading

Samoed commented Nov 5, 2025 •

edited

Loading